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^ . A Monte Carlo algorithm is said to be adaptive if it automatically cali- 

O I brates its current proposal distribution using past simulations. The choice of 

' the parametric family that defines the set of proposal distributions is critical 

Q> ■ for good performance. In this paper, we present such a parametric family for 

adaptive sampling on high-dimensional binary spaces. 
^ I A practical motivation for this problem is variable selection in a linear 

^5 ' regression context. We want to sample from a Bayesian posterior distribution 

r~| . on the model space using an appropriate version of Sequential Monte Carlo, 

"j^ I Raw versions of Sequential Monte Carlo are easily implemented using bi- 

nary vectors with independent components. For high-dimensional problems, 
however, these simple proposals do not yield satisfactory results. The key 
I to an efficient adaptive algorithm are binary parametric families which take 

J> ' correlations into account, analogously to the multivariate normal distribution 

^ ■ on continuous spaces. 

I We provide a review of models for binary data and make one of them 

\G ' work in the context of Sequential Monte Carlo sampling. Computational 

studies on real life data with about a hundred covariates suggest that, on 
difficult instances, our Sequential Monte Carlo approach clearly outperforms 
standard techniques based on Markov chain exploration. 

Keywords Adaptive Monte Carlo • Multivariate binary data • Sequential Monte Carlo 
^ ' Linear regression • Variable selection 

1 Introduction 

We present a Sequential Monte Carlo (Del Moral et al., 2006) algorithm for adaptive 
sampling from a binary distribution. A Monte Carlo algorithm is said to be adaptive 
if it adjusts, sequentially and automatically, its sampling distribution to the problem at 
hand. Besides Sequential Monte Carlo, important classes of adaptive Monte Carlo are 
Adaptive Importance Sampling (e.g. Cappe et al., 2008) and Adaptive Markov chain 
Monte Carlo (e.g. Andrieu and Thoms, 2008). 
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A central aspect of adaptive algorithms is their need for a parametric family of aux- 
iliary distributions which should have the following three properties: (a) the family is 
sufficiently flexible to guarantee a reasonable performance in the context of the specific 
algorithm; (b) it allows to quickly draw independent samples; (c) it can, with reasonable 
effort, be calibrated using past simulations. 

For problems in continuous sampling spaces, the typical example is the multivariate 
normal distribution, which clearly fulfils (b) and (c), and complies with (a) in many 
practical problems. In this paper, we propose an analogue for high-dimensional binary 
sampling spaces. 

1.1 Adaptive Monte Carlo on multivariate binary spaces 

Our objective is to construct a parametric family for Sequential Monte Carlo on the 
binary sampling space B"^ = {0, l}*^, where d is too large to allow for exhaustive enumer- 
ation of the whole space B"^. Since there is no multivariate binary family which we can 
easily parametrise by its first and second order moments like the multivariate normal, 
the construction of suitable proposal distributions seems more difficult for the discrete 
adaptive sampling problem than for its continuous counterpart. 

The major application for our algorithm is variable selection in linear regression mod- 
els. In this context, a binary vector 7 € B'^ encodes whether each of d possible covariates 
are included in the linear regression model or not. In a Bayesian framework, and for a ju- 
dicious choice of prior distributions, we can explicitly calculate the posterior distribution 
vr up to a constant. 

We want to sample from this distribution in order to approximate quantities like the 
expected value E^r ('y), that is the marginal probability of inclusion of each variable. 
Often, the marginal probabilities provide a richer picture of the posterior distribution 
than a collection of modes found using stochastic optimisation techniques. 

1.2 Global versus local methods 

Our Sequential Monte Carlo approach to variable selection views a well studied problem 
from a different angle and provides new perspectives. The reason is two-fold. 

Firstly, there is growing evidence that global methods, which track a population of 
particles, initially well spread over the sampling space B*^, are often more robust than 
local methods based on Markov chain Monte Carlo. The latter are more prone to get 
trapped in the neighbourhood of local modes. We largely illustrate this effect in our 
simulations in Section 6. 

Secondly, global methods have the property to be easily parallelisable. Parallel im- 
plementations of Monte Carlo algorithms have gained a tremendous interest in the very 
recent years (Lee et al., 2010; Suchard et al., 2010), due to the increasing availability of 
multi-core (central or graphical) processing units in standard computers. 

1.3 Plan and notations 

The paper is organised as follows. 

In Section 2, we recapitulate the basics of Bayesian variable selection in linear regres- 
sion models as the motivating application. 

In Section 3, we briefly review the principal Markov chain Monte Carlo methods which 
are commonly used to integrate with respect to a binary distributions. 
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In Section 4, we describe an alternative approach to the same problem using Sequential 
Monte Carlo methods. The key ingredient of this algorithm is a parametric family which 
is flexible enough to come close to the target distribution. 

In Section 5, we extensively discuss approaches for constructing rich parametric fami- 
lies on binary spaces. This is the core of our work. Some of the binary models discussed 
are not suitable in the framework of our Sequential Monte Carlo algorithm but mentioned 
for completeness of the survey. 

In Section 6, we construct two examples of variable selection problems which yield 
challenging posterior distributions. We show that standard Markov chain techniques fail 
to produce reliable estimates of the marginal probabilities while our Sequential Monte 
Carlo approach successfully copes with the integration problem. 

Notation For a vector x S A"^, we write xm for the sub-vector indexed by M C 
{!,..., d}. We write Xi-j if the indices are a complete sequence i, . . . ,j. We denote by 
X-i the sub-vector aJ{i,...,(i}\{j}. We write \x\ for Ylk=i^k- 

For a matrix A, we denote its components by a^, its determinant by |A|. The operator 
diag [x] transforms the vector x into a diagonal matrix. For a finite set M, we denote 
hy the number of elements in M. 

2 Variable selection: A binary sampling problem 

The standard linear normal model postulates that the relationship between the observed 
explained variable y E and the observations Z = [zi , . . . , z^] E M™'"^ is 

y I /3, 7, <t', Z - (Z diag [7] f3, a^Im) ■ 

Here, /3 is a vector of regression coefficients and o"^ the variance of y. We denote by 
Im the identity matrix and assume the first column Z.^i to be constant. The parameter 
'Y g B'^ = {0, 1}"' determines which covariates are included in or dropped from the linear 
regression model. In total, we can construct 2'^ different linear normal models from the 
data. 

We assign a prior distribution 7r(/3,(T^,7 | Z) to the parameters. From the posterior 
distribution 

7T{P,a\j I y,Z)cxTTiy \ (3,a\-f,Z)7r{f3,a\-f \ Z) 
we may compute the posterior probability of each model 

7r(7 I y,Z) = J 7r(/3,a2,7 | y,Z)d{p,a^) (1) 

by integrating out the parameters /3 and o"^. 

Hierarchical Bayesian model In a purely Bayesian context, we obtain, up to a constant, 
an explicit formula for the integral in (1) by choosing conjugate hierarchical priors, that 
is a normal 7r(/3 | cr^, 7, Z) and an inverse-gamma 7r((T^ | 7, Z). For all Bayesian posterior 
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distributions in this paper, we use the prior distributions 



7r(/3 I 7, Z) = (O, a^v^di&g [7]) , cj^ > 0, 

7r(cj2 I 7,Z) =X(u;/2,Au;/2), w; > 0, A > 0, 

7r(7 I Z) = 



where I denote an Inverse-Gamma and U a uniform law. 

For our numerical examples in Section 6, we assume not to have any prior information 
about the data. We follow the recommendations of George and McCulloch (1997) and 
choose the hyper-parameters 

w = 4:.0, X = dl, v'^ = W.O/X, (2) 

where is the least square estimate of cr^ based on the saturated model. The rationale 
behind this choice is to have a flat prior on /3 and provide cj^ with sufficient mass on the 
interval ((T^,(To), where denotes the variance of y. 

Next, we quickly state the form of the log-posterior mass function. We write Z-y for 
Z diag [7] without zero columns. Let = ZJ^y and 

C^,,Cl, = ZT Z^ + v-%\ (3) 

a Cholesky decomposition. We denote the least square estimate of based on v and 
the model 7 by 

We find the log-posterior probability to be 

log7r(7 I Z) = /i - ^Jl'i logcg'"^ - I7I log(z;) 

w + m ^ , ^ , ^2 ^ 
2 — log(u;A/m + cr^ „), 

where n is an unknown normalization constant. 



Related approaches In a Frequentist framework, we choose a model which minimizes 
some specified criterion. A popular one is Schwarz's Criterion (Schwarz, 1978, also 
Bayesian Information Criterion) which basically is a second degree Laplace approxima- 
tion of (1): 

log7r(7 \y,Z) ^ fi- — log(m) - — log(a^), 

where = lim^_>>oo ^^,v maximum likelihood estimator of cj^ based on the model 

7. Note that for a large sample size m the Hierarchical Bayesian approach and the 
Bayesian Information Criterion coincide. 



Alternative approaches The posterior of a Bayesian linear regression variable selection 
problem has, in general, no particular structure we can exploit to speed up optimisation 
or integration with respect to vr. Therefore, alternative approaches such as the Least 
Absolute Shrinkage and Selection Operator (Tibshirani, 1996) have been proposed which 
draw from the theory of convex optimization and allow for computation of larger prob- 
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lems. 

While a comparison between competing approaches to variable selection is beyond 
the scope of this paper, we remark that more sophisticated, parallelisable algorithms 
are essential for making Bayesian modelling feasible in the context of high dimensional 
problems where alternative methods are often used for practical reasons only. 



3 Markov chain Monte Carlo on binary spaces 

Markov chain Monte Carlo is a well-studied approach to approximate the expected value 
of a posterior vr given by a Bayesian model choice problem (George and McCulloch, 1997). 
In this section, we rapidly review the standard methods we are going to compare our 
Sequential Monte Carlo approach against. 

There are more advanced Markov chain Monte Carlo algorithms that use parallel 
tempering ideas combined with more elaborate local moves (see e.g. Liang and Wong, 
2000; Bottolo and Richardson, 2010), but a thorough comparison is beyond the scope of 
this paper. For background on Markov chain Monte Carlo methods, we refer to standard 
literature (see e.g. Robert and Casella, 2004, chaps. 7-12). 



3.1 Framework 

The idea is to construct a transition kernel /t, typically some version of a Metropolis- 
Hastings kernel, which admits vr as unique stationary distribution. Then, the distribution 
of the Markov chain Xt+i ~ K'ixt, •) started at some randomly chosen point xq G B"^ 
converges to vr. 

We obtain an estimate (7) ^ J2^=b for the expected value via the ergodic 
theorems for Markov chains. The first b states are usually discarded to give the chain 
some time to converge towards the stationary distribution before we start to average. 
For the estimate to be valid, we need to ensure that the at time b the chain is close to 
its stationary distribution vr, and at time n -|- 6 we have sampled an ergodic trajectory 
such that the ergodic theorems applies. 

Classic Markov chain methods on binary spaces work locally, that is they propose 
moves to neighbouring models in the Metropolis-Hastings steps. A neighbouring model 
is a copy of the current model where just a few components are altered. We shall see 
that these kinds of transition kernels often fail to sample ergodic trajectories within a 
reasonable amount of time if the stationary distribution ir is very multi-modal. 



Algorithm We loop over a uniformly drawn subset of components / ^ IA{{M C 

{l,...,(i} I #M = k}) and propose to change the components i £ I. The number 
of components k might be fixed or drawn from some distribution Q on the index set 
{l,...,d}. 

Precisely, we take a copy y of the current state Xt and replace by ~ ^Pi{x) for 
all i € /, where 

Bp^^^){j)=p,{xP{l-p,{x))'-''^ 

is a Bernoulli distribution with parameter pi{x) € (0, 1). We set Xf+i = y with proba- 
bility 

7r(y) Uiei ^My)(''t) ^^ .^ 
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and Xt^i = Xf otherwise. This framework, summarized in Algorithm 1, yields a Markov 
chain with unique invariant distribution tt for any fixed p € (0, 1)'^. The interesting 
special cases, however, use a p{x) which depends on the current state of the chain. 

Algorithm 1 Generic metropolised Gibbs kernel 
Input: X eM'^ 

I ~ U{{M C {1, . . . , d} I #M = A;}) 
y -^x 

for i G / do yi Sp^(^) 

if !^n,-|-M(^>f/ then x^y 

return x 



Performance We refer to the ratio (4) as the acceptance probability of the Metropolis- 
Hastings step. In binary spaces, however, accepting a proposal does not imply we are 
changing the state of the chain, since we are likely to re-propose the current state y = Xt. 
We are actually interested in how fast the chain explores the state spaces, precisely its 
mutation probability P (s^+i 7^ Xt). 



3.2 Standard Markov chain methods 

For this section, let A; = 1 be constant. Algorithm 1 collapses to changing a single 

component. Instead of independently drawing the index i ~ . . . we could 

also iterate i through a uniformly drawn permutations (t({1, . . . of the index set 

Kernels of this kind are often referred to as metropolised Gibbs samplers, since 
they proceed component-wise as does the classical Gibbs sampler, but also involve a 
MetropolisHastings step. In the sequel, we discuss some special cases. 



Classical Gibbs The Gibbs sampler sequentially draws each component from the full 
marginal distribution, which corresponds to 

Pi{x) := 7r(7i = 1 I 7_j = x_i) 

7r(7i = 1, = x^i) + 7r(7i = 0, 7_i = x_i) ' 

By construction, the acceptance probability is 1 while the mutation probability is only 
TT{y)/{Tr{xt) + TT{y)), where y is a copy of the current state Xt with component i altered. 



Adaptive metropolised Gibbs An adaptive extension of the metropolised Gibbs has 
been proposed by Nott and Kohn (2005). The full marginal distribution vr(7j = 1 \ 
'Y^j = X-j) is approximated by a linear predictor. In their notation. 



Pi{x) :-- 



Wi 



A (1-5), 
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where is the estimated mean, W~ the estimated covariance matrix and 5 € (0, 1/2) 
a design parameter which ensures that Pi{x) is a probabihty. Analogously to our vector 
notation, W_j denotes the matrix W without the ith row and column. We obtain the 
estimates from the past trajectory of the chain xi,, . . . , Xt^i and update them periodi- 
cally. 

The mutation probability is of the same order as that of the Gibbs kernel, but adaption 
largely avoids the computationally expensive evaluations of vr. The non-adaptive Gibbs 
sampler already requires evaluation of n{y) just to produce the proposal y. In contrast, 
the adaptive metropolised Gibbs samples from a proxy and only evaluates '7r(y) '\iy ^ Xf. 



Modified metropolised Gibbs In comparison to the classical Gibbs kernel, we obtain 
a more efficient chain (Liu, 1996) using the simple form 

Pi{x) := 1 - Xi. 

Since we always propose to change the current state, the acceptance and mutation prob- 
abilities are the same. They amount to ^{y) /t:{x) A 1, where y is a copy of the current 
state X with component i altered. Comparing the mutation probabilities of the two 
kernels, we see that the modified metropolised Gibbs chain moves, on average, faster 
than the classical Gibbs chain. 



3.3 Block updating 

The modified metropolised Gibbs easily generalises to the case where k may take values 
larger than one. Suppose, for example, we propose to change 

k ^ g,, (k) a ^ ^^1— l|i,...,,}(fc) 

components simultaneously, where Qk* is a truncated geometric distribution. Note that 
we suggest, on average, to change approximately k* components. In other words, for 
larger values of k* , we are more likely to propose further steps in the sampling space. 

Large step proposals improve the mixing properties of the chain and help to escape 
from the attraction of local modes. They are, however, less likely to be accepted than 
single component steps which leads to a problem-dependent trade-off. In our numerical 
examples, we could not observe any benefit from block updating, and we do not further 
consider it to keep the comparison with our Sequential Monte Carlo method more concise. 



3.4 Independent proposals 

We can construct a fast mixing Markov chain based on independent proposals. Let q be 
some distribution with tt ^ q, that is 5(7) = 7i"(7) = for all 7 € B"^. We propose 
a new state y ^ q and accept it with probability 

— — -— -— Al. (5) 

-K^xt) q[y) 

The associated Markov chain has the unique invariant measure vr. This kernel is referred 
to as the independent Metropolis-Hastings kernel, since the proposal distribution is not 
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a function of the current state £Cf . The mutation rate is the acceptance rate minus q{xt), 
so the two notions practically coincide in large spaces. 

Obviously, in order to make this approach work, we need to choose q sufficiently close 
to vr, which implies high acceptance rates on average. In absence of reliable prior infor- 
mation, however, we are not able to produce such a distribution q. We shall, however, 
use precisely this Markov kernel as part of our Sequential Monte Carlo algorithm. In 
this context, we can calibrate sequences qt of proposal distributions to be close to our 
current particle approximation. 

4 Sequential Monte Carlo on binary spaces 

In this section, we show how to estimate the expected value with respect to a probabil- 
ity mass function 7r(7) defined on B"' using Sequential Monte Carlo (Del Moral et al., 
2006). This general class of algorithms alternates importance sampling steps, resampling 
steps and Markov chain transitions, to recursively approximate a sequence of distribu- 
tions, using a set of weighted 'particles' which represent the current distribution. In the 
following, we present a version which is tailored to work on binary spaces. 

For readers not familiar with Sequential Monte Carlo, the following algorithm de- 
scribed might seem rather complex at first glance. We introduce the steps separately 
before we look at the complete algorithm. We give comprehensive instructions which 
correspond exactly to our implementation in order to make our results plausible and 
easily reproducible for the reader. 

4.1 Building a sequence of distributions 

The first ingredient of Sequential Monte Carlo is a smooth sequence of distributions 
{'Kt)l^Q, which ends up at the distribution of interest vr,- = vr. The intermediary distribu- 
tions TTf are purely instrumental: the idea is to depart from a distribution ttq with broad 
support and to progress smoothly towards the distribution of interest vr. 

Initial distribution Theoretically, we can use any ttq with vr ^ tto that can sample 
from as initial distribution. Numerical experiments taught us, however, that premature 
adjustment of ttq, for example using Markov chain pilot runs, leads to faster but less 
robust algorithms. 

Thus, in practice, we recommend the uniform distribution for its simplicity and reli- 
ability. Therefore, in the sequel, we let ttq = U{M'^). 

Intermediate distributions We construct a smooth sequence of distributions by judi- 
cious choice of an associated real sequence {Qt)t^Q increasing from zero to one. The most 
convenient and somewhat natural strategy is the geometric bridge (Gelman and Meng, 
1998; Neal, 2001; Del Moral et al., 2006) 

TTtif) :oc 7ro(7)^"^'7r(7)^' oc 7r(7)^*. (6) 

Alternatively, one could use a sequences of mixtures 

TTti-y) :oc (1 - £'t)7ro(7) + etvr(7) 
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or, in a Bayesian context, a sequences of posterior distributions where data is added as 
Qt increases, that is 

7rt(7) = 7r(7 I ^i, • • • , ^[ftH )' 

see (Chopin, 2002). In the fohowing, we use the geometric bridge (6) for its computa- 
tional simphcity and present a procedure to determine a suitable sequence {Qt)],=Q- 



Suppose we have already produced a sample a;^* ^\ . . . , Xri of size n from vrj^i. We 



4.2 Assigning importance weights 

Suppose we have already produced a sa 
can roughly approximate ttj by the empirical distribution 

n 

vr*(7)~E^*(4*"")^.l-l(7), (7) 

k 

k=l 

where the corresponding importance function wt is 

Wt[Xk) := ^ -—, ut{x):= — = tt*{x). (8 

Note that at = Qt — Qt-i is the step length at time t. As we choose at larger, that is nt 
further from 7rj_i, the weights become more uneven and the accuracy of the importance 
approximation deteriorates. 

Procedure 1 Importance weights 



Input: a, vr, X = . . . , a;„)T 
Uk ^ for all A; = 1, . . . , n 

Wk ^ Uk/(J27=i '^i) for all A; = 1, . . . , n 
return w = {wi^ . . . ,Wn) 



If we repeat the weighting steps until we reach vr,- = vr, we obtain a classical impor- 
tance sampling estimate with instrumental distribution ttq. The idea of the Sequential 
Monte Carlo algorithm, however, is to control the weight degeneracy such that we can 
intersperse resample and move steps before loosing track of our particle approximation. 



Effective sample size We measure the weight degeneracy through the effective sample 
size criterion, see (Kong et al., 1994). In our case, we have 

The effective sample size is 1 if all weights are equal and 1/n if all mass is concentrated 
in a single particle. 

For a geometric bridge (6), the effective sample size is merely a function of a. We can 
thus control the weight degeneracy by judicious choice of the step lengths at- 



4.3 Finding the step length 

We pick a step length a such that the effective sample size 77(a) equals a fixed value 
r)*, see (Jasra et al., 2008). Since ij is continuous and monotonously increasing in a, we 
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solve 

v{a,X)=rj* (9) 

using bi-sectional search, see Procedure 2. This approach is numerically more stable 
than a Newton- Raphson iteration, for the derivative dr](a,x)/da involves fractions of 
sums of exponentials which are difficult to handle. 

Let a* be the unique solution to (9). We can construct an associated sequence setting 
Qt = I A {gt-i + a*). Thus, the number of steps r depends on the complexity of the 
integration problem at hand and is not known in advance. 

In other words, for fixed if, the associated sequence {Qt)t is a self-tuning parameter. 
In our simulations, we always choose ?]* = 0.9, which yields convincing results on both 
example problems in Section 6. Smaller values significantly speed up the Sequential 
Monte Carlo algorithm but lead to a higher variation in the results. 

Procedure 2 Find step length 

Input: Q, X = {xi, . . . ,Xny 
1^0, 1.05 - p,a^ 0.05 
repeat 

if 'r]{a, X) < rj* then u a, a {a + l)/2 

else I a, a {a + u) /2 
until |n — / |<eor/>l — ^ 
return a A (1 — g) 



4.4 Resampling the system 

Suppose we have a sample X^*"'^^ = {xf ^\ . . . , Xn of size n from 7rt_i with impor- 
tance weights as defined in (8). We can obtain a sample X^ ^ = {xf \ . . . , Xn^) which is 
approximately distributed according to vr^ by drawing from the empirical approximation 
defined in (7). 

Procedure 3 Resample (systematic) 

Input: w = {wi, . . . ,Wn), X = (a^i, . . . ,a;„)T 
V ^ nw, J ^ 1, c f 1 
sample u ~ ^/([0, 1]) 
for A: = 1, . . . , n do 
while c < u do 

j ^ j + 1, C + Vj 

end while 

Kj, n u + 1 

end for 

return X = {xi . . . , 



For the implementation of the resampling step, there exist several recipes. We could 
apply a multinomial resampling (Gordon et al., 1993) which is straightforward. There 
are, however, more efficient ways like residual (Liu and Chen, 1998), stratified (Kitagawa, 
1996) and systematic resampling (Carpenter et al., 1999). We use the latest in our 
simulations, see Procedure 3. 
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In the resulting unweighted particle system X of size n, the particles with small 
weights have vanished while the particles with large weights have bee multiplied. There 
are approaches that resample a weighted particle system of size n from an augmented 
system of size m > n, see (Fearnhead and Clifford, 2003), but these techniques are 
computationally more demanding without visibly improving our numerical results. The- 
oretically, one would expect a Rao-Blackwellisation effect but its analysis is beyond the 
scope of this paper. 

In any case, if we repeat the weighting and resampling steps several times, we rapidly 
deplete our particle reservoir reducing the number of different particles to a very few. 
Thus, the particle approximation will be totally inaccurate. The key to fighting the 
decay of our approximation is the following move step. 

4.5 Moving the system 

The resampling step provides an unweighted particle system ^ = {xf\ . . . , Xn'^) of vr^ 
containing multiple copies of many particles. The central idea of the Sequential Monte 
Carlo algorithm is to diversify the resampled system, replacing the particles by draws 
from a Markov kernel Kt with invariant measure vrt (Gilks and Berzuini, 2001). 

Since the particle x^^^ is, approximately, distributed according to vTf, a draw x^j}'^ ~ 

Kt{x^^\ •) is again, approximately, distributed according to vrj. We can repeat this pro- 
cedure over and over without changing the target of the particle approximation. 

Note that, even if the particles x^^^ = ■ ■ ■ = are equal after resampling, the 

particles x^i^\...,Xm are almost independent after sufficiently many move steps. In 
order to make the algorithm practical, however, we need a transition kernel which is 
rapidly mixing and therefore diversifies the particle system within a few steps. Therefore, 
the locally operating Markov kernels reviewed in Section 3 are not suitable. In fact, our 
numerical experiments suggest that making a Sequential Monte Carlo algorithm work 
with local kernels is practically impossible. 

Therefore, we use a Metropolis-Hastings kernel with independent proposals as de- 
scribed in Section 3.4. Precisely, we construct a kernel Kt employing a parametric family 
qg on B"^ which, for some 9, is sufficiently close to vr^ to allow for high acceptance prob- 
abilities. 

For this purpose, we fit a parameter 6t to the particle approximation {wt,^t) of nt 
according to some convenient criterion. The choice of the parametric family qg is crucial 
to a successful implementation of the Sequential Monte Carlo algorithm. We come back 
to this issue in Section 5. 

Particle diversity We need to determine how often we move the particle system before 
we return to the weight-resample step. An easy criterion for the health of the particle 
approximation X = {xi, . . . , Xn) is its particle diversity 

C(X) := #{^fel^=l^---^"i e [1/n, 1], (10) 
n 

that is the proportion of distinct particles. Note that the particle diversity is a quality 
criterion which has no simple analogue in continuous sampling spaces. 

For optimal results, we recommend to keep on moving the particle system until the 
particle diversity cannot be augmented any longer. In the first steps of the algorithm. 
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Procedure 4 Move 



X^*^^ = {xf'\...,Xn'^) targeting 

Input: ^* 

k(2/,7) such that T^til) = 

s <s- 1 
repeat 

sample x^j^^ '-^ ^i^^k ') all /c = 1, . . . , n 
until |C(X(")) - C(X(""^^)| < 0.02 or ((X^'^) > 0.95 
return X^") = {x['^ . . . ,a;^f^)T 



TTf is still close to the uniform distribution, and we manage to raise the particle diversity 
up to one. As -/r^ is approaching a strongly multi-modal target distribution vr, however, 
the particle diversity reaches a steady-state we cannot push it beyond. 

Clearly, even if we could draw a particle system independently from vr, the particle 
diversity would be a lot smaller than one, since we would draw the modes of vr several 
times. 

Aggregated weights Shifting weights between identical particles does not affect the 
nature of the approximation but it changes the effective sample size r]{w) which seems 
paradoxical at first sight. For reasons of parsimoniousness, we could just keep a single 
representative x^, for identical particles and aggregate the associated 

weights to the sum w^, = w*-^ + • • • + w^^ without changing the quality of the particle 
approximation. There are, however, three reasons why we refrain from doing so. 

Firstly, it is vital not to confuse the weight disparity induced by reweighting accord- 
ing to the progression of vrt and the weight disparity due to aggregation of the weights 
of multiply sampled states. From the aggregated system, we cannot tell whether the 
effective sample size is determined by the gap between tt^ and irt+i, that is the step 
length a, or by the presence of particle copies due to the mass of ttj being very concen- 
trated. Therefore, it seems more difficult to control the smoothness of the sequence of 
distributions and find a suitable sequence {Qt)t=o- 

Secondly, aggregation is an additional computational effort equivalent to keeping the 
particle system sorted. Here, we trade in computing time for memory while the required 
memory is proportional to the number of particles and not critical in the context of our 
algorithm. 

Thirdly, the straightforward way to implement repeated move steps is breaking up the 
particles into multiple copies corresponding to their weights and moving them separately. 
Consequently, instead of permanently splitting and aggregating the weights we might 
just allow for multiple copies of the particles. 

4.6 The Resample-move algorithm 

Finally, we summarize the complete Sequential Monte Carlo method in Algorithm 2. 
Note that, in practice, the sequence vrj = vr''* is not indexed by t but rather by pt, that 
is the counter t is only given implicitly. 
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Algorithm 2 Resample-move 



Input: vr: B'^ [0, oo) 



sample hl{W^) for all A; = 1, . . . 
a ^ find step length(0, X) 
w ^ importance weights(a, vr, X) 

while Q <1 do 



n. 



(Procedure 2) 
(Procedure 1) 



qe ^ fit binary model(w, X) 

X ^ resample(ti;, X) 

X ^ move(K^^gg, X) 

a ^ find step length(/9, X) 



(Section 5) 



(Procedure 3) 
(Procedure 4) 
(Procedure 2) 



w ^ importance weights(a, vr, X) (Procedure 1) 
p ^ Q + a 
end while 

return X]fc=i '^feiCfc (7) 

For an efficient implementation, we recommend to store tlie values 7r(iC]^), . . . , 7r(^c^) 
and qe{xi), . . . , qe{xn) to avoid unnecessary evaluations. When updating the latter set, 
we can exploit the fact that, in a systematically resampled particle system, multiple 
copies of the same particles are neighbours. 

5 Multivariate binary models 

hi this section, we address the choice of a multivariate binary parametric family qg with 
parameter ^ G needed to construct the independent Metropolis-Hastings kernel used 
in Procedure 4. 

5.1 Desired properties 

We first frame the properties making a parametric family suitable for our Sequential 
Monte Carlo algorithm. 

(a) For reasons of parsimony, we want to construct a family of distributions with at most 
dim(0) < d{d+l)/2 parameters. More complex families are usually computationally 
too expensive to handle. 

(b) Given a sample X = (a;i, . . . , £c„) from the target distribution vr, we want to estimate 
9* such that the binary model qg* is close to vr. For instance, 0* might be a maximum 
likelihood or method of moments estimator. 

(c) We want to generate independent samples from qg. If we can compute the conditional 
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or marginal distributions, we can write qg as 

d 

qeh) = qeMYlqehihi-.i^i) (n) 

i=2 
d 

= Qeili) n Qe{li:i)/qe{li:i-i)- 

i=2 

Using the chain rule decomposition (11), we can sample a random vector 'y ^ qe 
component-wise, conditioning on the entries we already generated. 

(d) We need to rapidly evaluate qei"^) for any 7 € B'^ in order to compute the Metropolis- 
Hastings ratio (5). 

(e) Analogously to the multivariate normal, we want our calibrated binary model qe* to 
produce samples with the mean and covariance of vr. If qg is not flexible enough to 
capture the dependence structure of vr, the Metropolis-Hastings kernel in Procedure 
4 cannot provide satisfactory acceptance rates for complex target distributions vr. 

In the following we construct a suitable parametric family and explain how to deploy it 
in Algorithm 2. 

Most of the literature on binary data stems from response models, multi-way con- 
tingency tables and multivariate interaction theory (Cox, 1972). For completeness, we 
append a brief list of other binary models mentioned in the literature which fail, for 
various reasons, to work in Sequential Monte Carlo applications. Providing paramet- 
ric families which meet the above requirements in high dimensions is a difficult task 
and understanding the shortcomings of alternative approaches an important part of the 
discussion. 

5.2 The logistic conditionals model 

In the previous paragraph, we already mentioned that a factorization (11) of the mass 
function qe{'^) into conditional distributions permits to sample from the parametric 
family. Unfortunately, for a complex d-dimensional binary model, we usually cannot 
calculate closed-form expressions for the conditional or marginal mass functions. 

We get around the computation of the marginal distributions of qe{'~i) if we directly 
fit univariate models qbiili \ li-.i-i) to the conditionals 7r(7j | 7i:i-i) of the target 
function. Qaqish (2003) suggested the use of linear regressions to model the conditional 
probabilities. This approach, however, does not guarantee that the fitted model is a 
valid distribution since the mass function might be negative. 

Construction of the model We propose to rather use logistic regressions for the con- 
ditional probabilities. Precisely, we adjust the univariate models 

^(Ftt (7i = 1 I 7i:i-i)) := + I]j=i ^i,i7j, i = l,...,d 

where ^{p) = logp — log(l — p). In the context of our Sequential Monte Carlo application, 
we take the particle system X and regress y*-*^ = Xj on the columns Z*-*^ = (Xi:j_i, 1), 

(i) 

where the column ' yields the intercept to complete the logistic model. 
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For a d-dimensional lower triangular matrix B, we define the logistic conditionals 
model as 

QBh) ■■= nti ^p(fe,,.+6,,i.-i7L_i)(^*) 

where p{y) = i~^{y) = (1 + exp(— is the logistic function. Recall that Bp{'y) = 
—p)^~'^ denotes the univariate Bernoulli distribution with expected value p G [0, 1]. 

There are d\ possible logistic regressions models and we arbitrarily pick one while there 
should be a parametrization which is optimal in a sense of nearness to the data Z. We 
observed, however, that permuting the components had, in practice, no impact on the 
quality of the approximation. 

Keep in mind that the number of observations in the logistic regressions is the size 
n of the particle system and typically very large. For instance, we run our numerical 
examples in Section 6 using n = 2 x 10^ particles. Therefore, the fit of the logistic 
regressions is usually very good. 

Sparse version The major drawback of any kind of multiplicative model is the fact that 
we have no closed-form likelihood-maximizers, and therefore the parameter estimation 
requires costly iterative fitting procedures. Therefore, even before discussing the fitting 
procedure, we construct a sparse version of the logistic conditionals model which we can 
estimate faster than the saturated model. 

Instead of fitting the saturated model q{'yi \ 7i:j_i), we preferably work with a more 
parsimonious regression model like q{'yi \ 7lJ for some index set Lj C {1, . . . , i — 1}, 
where the number of predictors i^Li is typically smaller than i — We solve this nested 
variable selection problem using some simple, fast to compute criterion. 

Given a weighted particle system w € [0, 1]", X € B"^*^, we denote for i,j € {1, . . . ,d} 
the weighted sample mean by 

and the weighted sample correlation by 

ri ■ = (14) 
Y^Xj(l — Xi)xj(\ — Xj) 

For £ = 0.02, we define the index set 

I := {i e {I,... ,d} \ Xi i (e,l-e)}. 

which identifies the components which have, according to particle system, a marginal 
probability close to either boundary of the unit interval. 

For the components i € /, we do not consider fitting a logistic regression, but set 
Lj = and draw them independently. Precisely, we set hi^i = i{xi) and = which 
corresponds to logistic model without predictors. Firstly, interactions do not really mat- 
ter if the marginal probability is excessively small or large. Secondly, these components 
are prone to cause complete separation in the data or might even be constant. 

For the conditional distribution of the remaining components = {1, . . . ,d} \ I, we 
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construct parsimonious logistic regressions. For 5 = 0.075, we define the predictor sets 



Li:={j e{l,...,i-l}\5 < \rij\}, i € 

which identifies the components with index smaller than i and significant mutual asso- 
ciation. Running our examples in Section 6 with 5 = show that a saturated logistic 
regression kernel achieves about the same acceptance rates as a sparse one, while setting 
6 = 0.075 dramatically reduces the computational time we need to calibrate the model. 

Fitting the model We maximise the log-likelihood function £{b) = £{b \ y, Z) of a 

weighted logistic regression model by solving the first order condition di/db = 0. We 
find a numerical solution via Newton- Raphson iterations 

starting at some ; see Procedure 5 for the exact terms. Other updating formulas like 
Iteratively Reweighted Least Squares or quasi-Newton iterations should work as well. 

Procedure 5 Fitting the weighted logistic regressions 

Input: w = {wi, . . . ,Wn), X = (a;i, . . . , a;„)T, B G M'^^'^ 
for i G I'^ do 

repeat 

Pk ^ ^"^(Zfc6('^~^)) for all A; = l,...,n 

Ik ^ Pk{'^ - Pk) for all /c = 1, . . . ,n 

^ (ZTdiag [w] diag [q] Z + x 

(ZTdiag [w]) (diag [q] Z b^'~'^^ + {y - p)^ 
until 16^.''^ - b^J'~'^^\ < 10-3 for ah j 

end for 
return B 

Sometimes, the Newton-Raphson iterations do not converge because the likelihood 
function is monotone and thus has no finite maximizer. This problem is caused by data 
with complete or quasi-complete separation in the sample points (Albert and Anderson, 
1984). There are several ways to handle this issue. 

(a) We just halt the algorithm after a fixed number of iterations and ignore the lack of 
convergence. Such proceeding, however, might cause uncontrolled numerical prob- 
lems. 

(b) In general. Firth (1993) recommends Jeffrey's prior but this option is computation- 
ally rather expensive. Instead, we might use a Gaussian prior with variance 1/e > 
which adds a quadratic penalty term eb^b to the log-likelihood to ensure the target- 
function is convex. 
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(c) As we notice that some terms of bi are growing beyond a certain threshold, we move 
the component i from the set of components with associated logistic regression model 
I'^ to the set of independent components /. 

In practice, we combine the approaches (c) and (d). In Procedure 5, we did not elaborate 
how to handle non-convergence, but added a penalty term to the log-likelihood, which 
causes the extra el„ in the Newton-Raphson update. Since we solve the update equation 
via Cholesky factorizations, adding a small term on the diagonal also ensures that the 
matrix is indeed numerically decomposable. 

Starting points The Newton-Raphson procedure is known to rapidly converge for start- 

(0) (*) 
ing values &• not too far from the solution . In absence of prior information about 

b[*\ we would naturally start with a vector of zeros and maybe setting bf*^ = £{xi). 

In the context of our Sequential Monte Carlo algorithm we can do better than that. 
Recall that, we constructed a smooth sequence {1^1)1=0 distributions which corresponds 
to a sequence of proposal distributions {qt)],=Q = {qet)t=o which is associated to a sequence 
{6t)l^Q of parameters. 

It significantly speeds up the Newton-Raphson procedure if we choose Bt_i as starting 
point for the estimation of B^. Indeed, towards the end of the Sequential Monte Carlo 
algorithm, we fit, for the same precision, a logistic regression in less than four iterations 
on average when starting at Bt_i, compared to about 13 iterations on average when 
starting at zero. 

Sampling and evaluating In the move step of Sequential Monte Carlo we discussed in 
Section 4.5, we need to sample a proposal state y from qq and evaluate the likelihood 
to compute the Metropolis-Hastings ratio 5. For the logistic regression model qb, 
we can do both in one go, see Procedure 6. 

Procedure 6 Sampling from the model 
Input: B 

for i = 1 . . . , d do 

sample 7j ~ Bq 

jpx q if = 1 

[px {1- q) if yi = 
end for 
return y, p 



5.3 Why not use a simpler model? 

We briefly justify why we should not use a simpler parametric family for our Sequential 
Monte Carlo application. Indisputably, the easiest parametric family on M'^ that we can 
think of is a product model 

Qph) ■■= nti^P.(7») 

where jBp,(a,)(7) = pi{x)'^[l — pi{x))^~'^ denotes a Bernoulli distribution with expected 
value Pi{x) € [0, 1]. 
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Figure 1: Toy example showing liow well the product model qp and the logistic regression model 
(7b replicate the mass function of a difficult posterior distribution tt. 



(a) true mass function 7r(7) (b) product model 9^(7) (c) logistic regression model 93(7) 




Let us check the requirement list: the product model is parsimonious with dim(0) = 
d; the maximum likelihood estimator 6* is the sample mean x = n~^^^^^Xk; the 
decomposition (11) holds trivially, which allows us to sample from qp and evaluate gp(7) 
in 0{d). 

Obviously, however, qp does not reproduce any dependencies we might observe in X. 
Could we just forget about this last point and use the product model for its simplicity? 

Toy example We take a simple linear relation Y = Vi + V2. For n = 100 and /i = 10, 

we draw normal variates 

f 1 ~ A/" (-/X, In) , U2 ~ A/" In) , y = Vi+V2 

where y is the vector of observations and 

Zi, Z2 ~ (^^1, (AiV4) In) , 2=3, Zi^M {v2, / ^) In) • 

four columns of covariates. 

The posterior distribution 7r(7) = 7r(')' | Z), using the prior distributions as de- 
scribed in Section 2, typically exhibits strong dependencies between its components due 
to the correlation in the data. 

Now we generate pseudo-random data Z from vr and fit both a product model qp and 
a logistic regression model q^. Looking at the corresponding mass function in Figure 
1, we notice how badly the product model mimics the true posterior. This observation 
carries over to larger sampling spaces. 

Acceptance rates A good way to analyse the importance of reproducing the dependen- 
cies of vr is in terms of acceptance rates and particle diversities. As we already remark in 
Section 4.5, the particle diversity naturally diminishes as our particle system approaches 
a strongly concentrated target distribution vr. However, we want our algorithm to keep 
up the particle diversity a long as possible to ensure the particle system is well spread 
out over the entire state space. 

In Figure 2, we show a comparison (based on the Boston Housing data set explained 
in Section 6.1) between two Sequential Monte Carlo algorithms, using a product model 
qp and a logistic regression model qb as proposal distribution of the Metropolis-Hastings 
kernel (5). 

Clearly, in Figure 2(a), the acceptance rates achieved by the product kernel rapidly 
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Figure 2: We compare the use of a product model qp to a logistic regression model (/b as proposal 
distribution of the Metropolis-Hastings kernel (5). We monitor a typical run [g on 
the X-axis) of our Sequential Monte Carlo algorithm (for the Boston Housing data set 
described in Section 6.1) and plot the acceptance rates and particle diversities (on 
the y-axis). 



(a) acceptance rates (b) particle diversities 




0,0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



decrease and dwell around 5% for the second half of the run. In contrast, the logistic 
regression kernel always provides acceptance rates greater than 20%. As a consequence, 
in Figure 2(b), the particle diversity sustained by the product kernel decreases at an 
early stage, while the logistic regression kernel holds it up until the very last steps. 

At first sight, it might seem odd that the acceptance rates of the logistic regression 
kernel increase during the final steps of the algorithm. If we jump ahead, however, and 
take a look at the results of the Boston Housing problem, see Figure 3(a), we notice that 
quite a few marginal probabilities of the posterior vr turn out to be zero, which makes it 
easier to reproduce the distributions towards the end of the Resample-Move algorithm. 

However, if we already decide at an early stage that for some component P (7^ = 1) = 0, 
we fail to ever consider states 7 € B'^ with 7i = 1 for the rest of the algorithm. Therefore, 
the advantage of the logistic regression kernel over the simple product kernel is that we 
do not completely drop any components from the variable selection problem until the 
final steps. 

5.4 Review of alternative binary models 

In the following, we review some alternative approaches to modeling multivariate binary 
data. Unfortunately, we cannot incorporate any of these models in our Sequential Monte 
Carlo algorithm. Still, it is instructive to understand why alternative strategies fail to 
provide suitable proposal distributions in the sense of Section 5.1. For a more detailed 
review of parametric families suitable for adaptive Monte Carlo algorithms on binary 
spaces, see Schafer (2010). 

Quadratic multi-linear models For coefficients a G M^'*, we can write any mass function 
on B'^ as 

<l) = E5c{i,...,d} as Dies Ti- 
lt is tempting to construct a d{d -|- l)/2 parameter model 

9/., A (7) := ^ + 7^A7 
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by removing interaction terms of order higher than two. As Bahadur (1961) points out, 
the main problem of any additive approach is the fact that a truncated model might not 
be non-negative and thus not define a probability distribution. 

Although the linear structure allows to derive explicit and recursive formulae for the 
marginal and conditional distributions, we hardly ever find a useful application for the 
additive model. As other authors (Park et al., 1996; Emrich and Piedmonte, 1991) 
remark, additive representations like the much-cited Bahadur (1961) expansion are quite 
instructive but, unfortunately, impractical. 



Quadratic exponential models For coefficients o € , we can write any mass function 
on B'^ as 

7r(7) = exp {Esc{i,...,d} UieS 7*-) 
Removing higher order interaction terms, we can construct a d{d+l)/2 parameter model 

gA.,A(7) := Mexp(7TA7), (16) 

where A is a symmetric matrix. Quadratic exponential models are a well defined class of 
distributions, but there is no simple recursive structure for their marginal distributions. 
Hence, we cannot compute the factorization (11) we need to sample from q^. 

Cox and Wermuth (1994) propose an approximation to the marginal distributions by 
expressions of the form (16), omitting higher order terms in a Taylor expansion. If we 
write the parameter A as 

'A' 
b c 

the parameter of the marginal distribution qAi-a-iili-.d-i) is approximately given by 
Ai:rf_i « A' + (1 + tanh(f )) diag [b] + i sech^^ )6bT^ 

and the normalizing constant is ^i-a-i = Ai(l + exp(c)). We can recursively compute 
approximations to all marginal distributions gAi-d-i) • • • i ^Ai i and derive logistic forms 

^(P (7. = 1 I 71.-1)) = log ^^^4^^^^^' 

lA^Ali = 0,7l:i-l) 

which takes us back to (12). However, there is no reason to fit a quadratic exponential 
model and compute the approximate logistic model if we can directly fit the logistic 
conditionals model in the same time. 



Latent variable models Let be a parametric family on X and r : X ^ IB a mapping 
into the binary state space. We can sample from a latent variable model 

qeil) ■■= /r-i{-y) Veiv)dv 

by setting y = t{v) for a draw v ^ ipg from the latent parametric family. 

Non- normal parametric families with d{d — l)/2 dependence parameters seem to ei- 
ther have a very limited dependence structure or unfavourable properties (Joe, 1996). 
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Therefore, the multivariate normal 

-riv) = (l(oo,0](^^l), • • • ,l(oo,0](^^d)), 

appears to be the natural and almost the only option for pg. This kind of model has 
been discussed repeatedly in the literature (Emrich and Piedmonte, 1991; Leisch et al., 
1998; Cox and Wermuth, 2002). 

The first and second order marginal probabilities of the model Q(n^Yi) given by 
and 02{lJ'i, IJ-j] o'ij), respectively, where ^i(wi) and ^2ivi,Vj; aij) denote the cu- 
mulative distribution functions of the univariate and bivariate normal distributions with 
zero mean, unit variance and correlation aij G [—1,1]. 

We can fit the model Qi^n^s) to a particle system (id, X) by matching the moment, that 
is adjusting and S such that 

^lim) = Xi, ^i{fj.i,fj.j;aij) = rij 

with Xi and rij as defined in (13) and (14). However, the locally constructed correlation 
matrix XI might not be positive definite. Still, we can obtain a feasible parameter 
replacing 5] by XI* = (5] + |A| I)/(l + |A|), where A is smaller than all eigenvalues of the 
locally adjusted matrix XI. 

The main drawback of latent variable approaches is the fact that that point-wise 
evaluation of the probability mass function qg{y) is computationally feasible only in 
special cases. Hence, we cannot use this class of models in a Sequential Monte Carlo 
context. 

Archimedean copula models The potentials and pitfalls of applying copula theory, 
which is well developed for bivariate, continuous random variables, to multivariate dis- 
crete distribution is discussed in Genest and Neslehova (2007). There have been earlier 
attempts to sample binary vectors via copulae: Lee (1993) describes how to construct an 
Archimedean copula, more precisely the Frank family (Nelsen, 2006, p. 119), for sampling 
multivariate binary data. Unfortunately, this approach is limited to very low dimensions. 

Multivariate reduction models Several approaches to generating multivariate binary 
data are based on a representation of the components "fi as functions of sums of indepen- 
dent variables (Park et al., 1996; Lunn and Davies, 1998; Oman and Zucker, 2001). These 
techniques are limited to certain patterns of non-negative correlation, and do, therefore, 
not yield suitable proposal distributions in a Sequential Monte Carlo application. We 
mention them for the sake of completeness. 

6 Numerical experiments 

In this section we compare our Sequential Monte Carlo algorithm to standard Markov 
chain methods based on local moves as introduced in Section 3. These are standard 
algorithms and widely used. There are other recent approaches like Bayesian Adaptive 
Sampling (Clyde et al., 2011) or Evolutionary Stochastic Search (Bottolo and Richard- 
son, 2010) which also aim at overcoming the difficulties of multi-modal binary distribu- 
tions. However, a thorough and just comparison of our Sequential Monte Carlo approach 
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to other advanced methods needs careful consideration and is beyond the scope of this 
paper. 

For testing, we created variable selection problems with high dependencies between 
the covariates which yield particularly challenging, multi-modal posterior mass functions. 
The problems are build from freely available datasets by adding logarithms, polynomials 
and interaction terms. The Markov chain Monte Carlo methods presented in Section 3 
tend to fail on these problems due to the very strong multi-modality of the posterior 
distribution while the Sequential Monte Carlo approach we advocate in Section 4 yields 
very reliable results. 

Note, however, that using Sequential Monte Carlo we do not get something for noth- 
ing. Firstly, the implementation of our algorithm including the logistic conditionals 
model introduced in Section 5.2 is quite involved compared to standard Markov chain 
algorithms. Secondly, simple Markov chain methods are faster than our algorithm while 
producing results of the same accuracy if the components of the target distribution are 
nearly independent. 

6.1 Construction of the data sets 

We briefly describe the variable selection problems composed for our numerical experi- 
ments. 

Boston Housing The first example is based on the Boston Housing data set, originally 
treated by Harrison and Rubinfeld (1978), which is freely available at the StatLib data 
archive. The data set provides covariates ranging from the nitrogen oxide concentration 
to the per capita crime rate to explain the median prices of owner-occupied homes, see 
Table 1. The data has yet been treated by several authors, mainly because it provides 
a rich mixture of continuous and discrete variables, resulting in an interesting variable 
selection problem. 

Specifically, we aim at explaining the logarithm of the corrected median values of 
owner-occupied housing. We enhance the 13 columns of the original data set by adding 
first order interactions between all covariates. Further, we add a constant column and a 
squared version of each covariate (except for CHAS since it is binary). 

This gives us a model choice problem with 104 possible predictors and 506 observa- 
tions. We use a hierarchical Bayesian approach, with priors as explained in the above 
Section 2, to construct a posterior distribution vr. By construction, there are strong de- 
pendencies between the possible predictors which leads to a rather complex, multi-modal 
posterior distribution. 

Concrete Compressive Strength The second example is constructed from a less known 
data set, originally treated by Yeh (1998), which is freely available at the UCI Machine 
Learning Repository. The data provides information about components of concrete to 
explain its compressive strength. The compressive strength appears to be a highly non- 
linear function of age and ingredients. 

In order to explain the compressive strength, we take the 8 covariates of the original 
data set and add the logarithms of some covariates (indicated by the prefix lg), see 
Table 2. Further, we add interactions between all 13 covariates of the augmented data 
set and a constant column. 
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This gives us a model choice problem with 79 possible predictors and 1030 observa- 
tions. We use a hierarchical Bayesian approach, with priors as explained in the above 
Section 2, to construct a posterior distribution n. 

Protein activity data The third example has originally been analyzed by Clyde and 
Parmigiani (1998). Later, Clyde et al. (2011) used it as a challenging example problem 
in variable selection and included the raw data in the R-package BAS available at CRAN 
which implements the Bayesian Adaptive Sampling algorithm. 

In order to explain the protein activity (prot.actI), we first convert the factors BUF, 
RA and DET into a factor model. We enhance the 14 columns of this data set by adding 
first order interactions between all covariates and a constant column. For details on the 
raw data see Table 3. 

Note that some columns turn out to be constant zeros such that we obtain a model 
choice problem with 88 possible predictors and 96 observations. For reasons of con- 
sistency, we choose the priors explained in the above Section 2 instead of the original 
g-phor used in Clyde et al. (2011). 

6.2 Main effect restrictions 

In some statistical applications we might want to only include the interactions if the 
corresponding main effects are present in the model. These constraints are easy to 
incorporate if needed but render the sampling problem even more challenging since the 
constrained support makes the state space exploration more difficult. 

Let d denote the number of main effects and 'jij the interaction of the main effects 7^ 
and 7j for all i,j = l,...,d. For the variable selection problem on B'^^'^^^^/^, we impose 
the prior constraints on the feasible interactions 

7r(7 I Z) oc l|^gigd(d+i)/2[^^^.<.^^^^, for all i,j=i,...,d}h)- (17) 

While the Markov chain Monte Carlo algorithms can proceed as before, we need to 
slightly modify our Sequential Monte Carlo approach. When sampling from a restricted 
distribution, we initialise the particle system with an iid sample from the prior (17) in- 
stead of the uniform distribution. In the sequel, we report for each dataset a comparison 
with and without the main effect restrictions. 

6.3 How to compare to Markov chain Monte Carlo 

We do not think it is reasonable to compare two completely different algorithms in 
terms of pure computational time. We cannot guarantee that our implementations are 
optimal nor that the time measurements can exactly be reproduced in other computing 
environments. 

We suppose that the number of evaluations of the target function vr is more of a 
fair stopping criterion, since it shows how well the algorithms exploit the information 
obtained from vr. Precisely, we parameterise the Sequential Monte Carlo algorithm to not 
exceed a fixed number v of evaluations and stop the Markov chains when 1/ evaluations 
have been performed. 

Assets and drawbacks The Sequential Monte Carlo and the Markov chain Monte Carlo 
algorithms both have extensions and numerical speed-ups which make it hard to settle 
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on a fair comparison. 

Advocates of Markov chain methods might criticise that the number of target eval- 
uations is a criterion biased towards the Sequential Monte Carlo approach, for there 
are updating schemes which allow for faster computation of the Cholesky decomposition 
(3) given the decomposition of a neighbouring model, see Dongarra et al. (1979, chaps. 
8,10). Thus, Markov chains which propose to change one component in each step can 
evaluate tt with less effort and perform more evaluations of vr in the same time. 

On the other hand, however, the Sequential Monte Carlo algorithm can be parallelised 
in the sense that we can, on suitable hardware, run many evaluations of vr in parallel 
during the move step, see Procedure 4. No analogue speed-up can be performed in the 
context of Markov chains. We have processed variable selection problems from genetics 
with about a thousand covariates within a few hours running a parallelised version 
of the algorithm on a 64-CPU cluster. A detailed report is going to be published as 
supplementary material. 

Further, Sequential Monte Carlo methods are more suitable than Markov chain Monte 
Carlo to approximate the evidence, that is the normalization constant of the posterior 
distribution. We can exploit this property to compare, for instance, regression models 
with different monotonic link functions. 

Parameters We run our Sequential Monte Carlo (SMC) algorithm with n = 1.5 x 10^ 
particles and a target effective sample size r] = 0.9, as explained in Section 4. For 
these parameters, the Sequential Monte Carlo algorithm needs less than = 2.5 x 10^ 
evaluations of vr on all examples problems. 

We compare our algorithm to both the Adaptive Markov chain Monte Carlo (Nott 
and Kohn, 2005, AMCMC) and the standard metropohsed Gibbs (Liu, 1996, MCMC) 
described in Section 3. For the MCMC, we draw the number of bits to be flipped from a 
truncated geometric distribution with mean k* = 2 as proposed in Section 3.3. However, 
as stated earlier, we could not observe a significant effect of changes in the block updating 
schemes on the quality of the Monte Carlo estimate. 

For the AMCMC, we use S = 0.01 and A = 0.01, following the recommendations of 
Nott and Kohn (2005). We update the estimates ^p and W every 2 x 10^ iterations of 
chain. Before we start adapting, we generate 2.5 x 10^ iterations with a metropohsed 
Gibbs kernel (after a discarded burn-in of 2.5 x 10^ iterations). 

6.4 Implementation 

The numerical work was completely done in Python 2.6 using SciPy packages and run 
on a cluster with 1.86GHz processors. Scientific work in applied fields is often more 
accessible to the reader if the source code which generated numerical evidence is released 
along with the publication. The complete, documented sources used in this work can be 
found at http://code.google.eom/p/snicdss. 

We also provide instructions on how to install and run our project. The program can 
process data sets in standard csv-format and generate R scripts for graphical visualisation 
of the results. The released version was tested to run on both Windows and Linux 
machines. 
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6.5 Results and discussion 

We run each algorithm 200 times and each time we obtain for all covariates a Monte 
Carlo estimate of the marginal probability of inclusion in the normal linear model. We 
visualize the variation of the estimator by box-plots that show how much the Monte 
Carlo estimates have varied throughout the 200 runs (Figures 3 and 5). Here, the white 
boxes contain 80% of the Monte Carlo results, while the black boxes show the extent 
of the 20% outliers. For better readability, we add a coloured bar up to the smallest 
estimate we obtained in the test runs; otherwise components with a small variation are 
hard to see. 

The vertical line in the white box indicates the median of the Monte Carlo estimates. 
The median of the Sequential Monte Carlo runs correspond very precisely to the results 
we obtained by running a Markov chain Monte Carlo algorithm for a few days. Unques- 
tionably, the Sequential Monte Carlo algorithm is extremely robust; for 200 test runs 
and for both data sets, the algorithm did not produce a single major outlier in any of 
the components. 

This not true for either of the Markov chain algorithms. The size of white boxes 
indicate that adaptive Markov chain Monte Carlo works quite better than the standard 
Markov chain procedure. However, even the adaptive Markov chain method is rather 
vulnerable to generating outliers. The large black boxes indicate that, for some starting 
points of the chain, the estimates of some marginal probabilities might be completely 
wrong. 

The outliers, that is the black boxes, in the MCMC and the AMCMC plots are strik- 
ingly similar. The adaptive and the standard Markov chains apparently both fall into the 
same trap, which in turn confirms the intuition that adaption makes a method faster but 
not more robust against outliers. An adaptive local method is still a local method and 
does not yield reliable estimates for difficult binary sampling problems. Figure 8 suggests 
that in constrained spaces adaption is difficult and might even have contra-productive 
effects. 

In Tables 3 to 8, we gather some key performance indicators, each averaged over the 
200 runs of the respective algorithms. Note that the time needed to perform 2.5 x 10^ 
evaluations of vr is a little less than the running time of the standard Markov chain. Thus, 
even in terms of computational time, the adaptive Markov chain can hardly compete 
with our Sequential Monte Carlo method, even if evaluations of vr were at no cost. 
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Figure 3: Boston Housing data set. For details see Section 6.5. 

(a) SMC - 1.4 X 10*^evarns of tt (b) AMCMC 2.5 x lO^eval'ns of tt (c) MCMC 2.5 x lO^eval'ns of tt 
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Table. Boston Housing data set. Averaged key indicators complementary to Figure 3. 





Sequential MC Adaptive MCMC Standard MCMC 


computational time 


: 36 : 59 h 


4 : 50 : 52 h 


: 38 : 06 h 


evaluations of vr 


1.36 X 10^ 


2.50 X 10^ 


2.50 X 10^ 


average acceptance rate 


36.4% 


29.1% 


0.81% 


length t of the chain Xt 




7.52 X 10^ 


2.50 X 10^ 


moves Xt ^ Xt-i 




7.28 X 10^ 


2.07 X 10"^ 
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Figure 4: Boston Housing data set witli main effect restrictions. For details see Section 6.5. 
(a) SMC ~ 1.2 X lO^eval'ns of tt (b) AMCMC 2.5 x lO^eval'ns of vr (c) MCMC 2.5 x lO'^evarns of vr 




Table. Boston Housing data set with main effect restrictions. Averaged key indicators 
complementary to Figure 4. 

I Sequential MC Adaptive MCMC Standard MCMC 



computational time 
evaluations of vr 
average acceptance rate 
length t of the chain Xt 
moves Xf ^ Xt^i 



: 18 : 05 h 
1.15 X 10^ 
20.79% 



4 : 33 : 20 h 
2.50 X 10^ 
45.4% 
8.01 X 10^ 
1.13 X 10^ 



: 14 : 13 h 
2.50 X 10^ 
1.20% 
2.50 X 10^ 
2.96 X 10^ 
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Figure 5: Concrete Compressive Strengtli data set. For details see Section 6.5. 
(a) SMC ~ 1.2 X lO^eval'ns of tt (b) AMCMC 2.5 x lO^eval'ns of vr (c) MCMC 2.5 x lO^eval'ns of vr 




Table. Concrete Compressive Strength data set. Averaged key indicators complementary to 
Figure 5. 



Sequential MC Adaptive MCMC Standard MCMC 



computational time 
evaluations of vr 
average acceptance rate 
length t of the chain Xt 
moves Xf ^ Xt^i 



: 29 : 01 min 
1.19 X 10^ 
30.7% 



02 : 06 min 
2.50 X 10^ 
70.4% 
2.43 X 10^ 
1.76 X 10^ 



: 43 : 17 min 
2.50 X 10^ 
7.20% 
2.50 X 10^ 
1.79 X 10^ 
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Figure 6: Concrete Compressive Strengtli data set witli main effect restrictions. For details see 
Section 6.5. 

(a) SMC ~ 2.4 X lO'^evarns of tt (b) AMCMC 2.5 x lO^eval'ns of tt (c) MCMC 2.5 x f O'^evaPHS of vr 








Table. Concrete Compressive Strength data set with main effect restrictions. Averaged fcey 
indicators complementary to Figure 6. 

I Sequential MC Adaptive MCMC Standard MCMC 



computational time 
evaluations of vr 
average acceptance rate 
length t of the chain Xt 
moves Xt Xt^i 



: 43 : 01 min 
2.42 X 10^ 
30.98% 



2 : 29 : 16 min 
2.50 X 10^ 
61.1% 
2.72 X 10^ 
1.53 X 10^ 



: 41 : 48 min 
2.50 X 10^ 
5.31% 
2.50 X 10^ 
1.32 X 10^ 
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Figure 7: Protein data set. For details see Section 6.5. 

(a) SMC ~ 6.1 X lO^eval'ns of tt (b) AMCMC 2.5 x lO^eval'ns of tt (c) MCMC 2.5 x lO'^evarns of vr 




Table. Protein data set. Averaged key indicators complementary to Figure 7. 

I Sequential MC Adaptive MCMC Standard MCMC 



computational time 
evaluations of vr 
average acceptance rate 
length t of the chain Xt 
moves Xt ^ Xt-i 







14 : 55 min 
6.17 X 10^ 
30.7% 



58 : 32 min 

2.50 X 10^ 
60.7% 

9.19 X lO'^ 

1.51 X 10^ 



: 29 : 38 min 
2.50 X 10^ 
1.20% 
2.50 X 10^ 
3.03 X 10^ 
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Figure 8: Protein data set witli main effect restrictions. For details see Section 6.5. 
(a) SMC ~ 6.1 X lO^eval'ns of tt (b) AMCMC 2.5 x lO^eval'ns of vr (c) MCMC 2.5 x lO^eval'ns of vr 




MQCISj detn 3 




Table. Protein data set with main effect restrictions. Averaged key indicators complementary 
to Figure 8. 



Sequential MC Adaptive MCMC Standard MCMC 



computational time 
evaluations of vr 
average acceptance rate 
length t of the chain Xt 
moves Xf ^ Xt^i 







14 : 45 min 
6.19 X 10^ 
26.65% 



32 : 06 min 
2.50 X 10^ 
22.3% 
1.07 X 10*^ 
5.56 X 10^ 







30 : 21 min 
2.50 X 10^ 
1.20% 
2.50 X 10^ 
3.03 X 10^ 
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Table 1: Boston Housing data summary. 

short name explanation 



CRIM per capita crime 

ZN proportions of residential land zoned 

for lots over 2323 m^ 
INDUS proportions of non-retail business acres 

CHAS tract borders Charles River (binary) 

NOX nitric oxides concentration (parts per 10^) 

RM average numbers of rooms per dwelling 

AGE proportions of owner-occupied units 

built prior to 1940 
DIS weighted distances to five Boston 

employment centres 
RAD accessibility to radial highways 

TAX full-value property-tax rate per USD 10^ 

PTRATio pupil-teacher ratios 
B (Bk — 0.63)^ where Bk is the proportion 

of the black population 
LSTAT percentage of lower status population 



Table 2: Concrete Compressive Strength data summary. Components arc measured as kg/m"^ 



short name explanation 



C, LG_C 
BLAST 
FASH 
W, LG_W 
PLAST 
CA, LG_CA 
FA, LG_FA 
AGE, LG_AGE 



cement 

blast furnace slag 

fly ash 

water 

superplasticizer 
coarse aggregate 
fine aggregate 
age in days 



Table 3: Protein activity data summary. 



short name explanation 

DET detergent 

BUF pH buffer 

NaCl salt 

CON protein concentration 

RA reducing agent 

MgCl2 magnesium chloride 

TEMP temperature 
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